Newsletter

Saturday, November 21, 2009 VOLUME 1 ISSUE 16  
HOME
Web harvesting: new problems, new solutions.

Chris Buckingham, President, Caesius Software.
chris@caesius.com


**********

Summary: Web harvesting tools -- software for automatically extracting information -- have made it easier to find and use competitor information. As web content becomes more diverse and web sites more sophisticated, web harvesting tools themselves have to adapt. Chris Buckingham reviews the features that are needed to support your CI requirements and deal with the complexity of modern web sites.

**********


The web has experienced rapid growth and success as a public information source, but gathering information from web can be tedious and time consuming, even with the help of a search engine.

In the late 90s along came web harvesting tools - software for automatically extracting information from the web. These tools picked up where search engines left off by automating the eyeballing, copying, and pasting necessary to collect web information for analysis. Very handy gizmos to find and extract competitors’ on-line press releases, pulling out specific financial information from EDGAR on-line, or creating a competitive pricing database from on-line catalogs. But as web content become more diverse and web sites more sophisticated, a new generation of web harvesting tools is needed to reach this content and navigate these sites.

So…what should this next generation of web harvesting tools contain?


Format flexibility

Web content is no longer just HTML. It resides in a variety of formats - .pdf, .doc, .xls, etc. Once this content is harvested, users want it in the format of their choice, not the tool vendor’s choice (most vendors are in love with XML). Ideally, a web harvesting tool should extract information from internal and external sources, not just HTML, and transform it into a variety of structured formats, not just XML.


“Deep web” access
 
Search engines and first generation web harvesting tools can follow static links, but that’s no longer good enough. Many sites now hide information behind dynamic links generated from web forms or imbedded in scripting languages. This hidden information is often referred to as the “deep web”. To reach the deep web, tools must automatically discover and fill out web forms and synthesize dynamic links.


Output consolidation

A CI professional may need to collect information from multiple sites, but wants it consolidated into a single report. Taking it a step further, she may want the harvested information incorporated into an existing database or spreadsheet. Web harvesting tools need to be flexible enough to bring information together from multiple harvests and/or multiple sites into a single format or existing template or worksheet.


Logical pinpointing

Most first generation tools require the user to first physically pinpoint (usually with a mouse) the information they want harvested. Then the tool remembers the navigation path and screen position so the extraction is automated for future harvests. It's easy, but useful only to repetitively harvest information from a small, fixed number of unchanging pages.

Logical pinpointing, on the other hand, systematically finds the information by identifying information based on a defined criteria (such as a Boolean expression). While it may not be as easy as physical pinpointing, logical pinpointing is very efficient when the information resides on a variable number of changing pages.


Anonymity

Many sites have implemented measures to thwart unwanted visitors. To get around this some organizations anonymize their digital identities. The same precautions should be taken when using web harvesting tools. Anonymization capabilities should be integrated into the web harvesting tool, not left to the user to figure out.


Performance on demand

Some sites are very slow, particularly during busy periods, so getting all the information needed within a limited time window may be impossible. To address this, a web harvesting tool either needs to provide performance on demand by initiating and managing simultaneous harvests or bail out and recommend or schedule the harvest for a less busy time.


Scheduling

Scheduled harvesting is very useful for two reasons:
  • Running harvests at night can have information organized and ready for analysis in the morning. The productivity of information analysts should increase if collection time is replaced by analysis time.
  • Some sites are updated periodically throughout the day and night. If the information is time critical, it should be harvested as close to the time of update as possible so managers can be alerted to changing business conditions.
On some sites information is updated continually throughout the day (i.e. stock quotes) so the tool should constantly keep an eye on the site. On other sites information is updated at appointed times (i.e. power generation metrics) so the tool may be scheduled to harvest within minutes of the scheduled update.

Most CI departments can benefit from a web harvesting tool, but before succumbing to the marketing hype, make sure the features are there to support your requirements and deal with the complexity of today’s modern web sites.


Background:

Christopher J. Buckingham is president of Caesius Software (www.caesius.com), makers of WebQL, an award winning web harvesting software solution.  He has over 28 years of experience in the computer hardware and software industry.
 

Copyright Society of Competitive Intelligence Professionals www.scip.org


SCIP.online, volume 1 number 16, September 22, 2002
[PRINTER FRIENDLY VERSION]

There are no letters available.

[POST]
Powered by iMakeNews.com